Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce

نویسندگان

  • MingJie Tang
  • Yongyang Yu
  • Walid G. Aref
  • Qutaibah M. Malluhi
  • Mourad Ouzzani
چکیده

Similarity search is crucial to many applications. Of particular interest are two flavors of the Hamming distance range query, namely, the Hamming select and the Hamming join (Hamming-select and Hamming-join, respectively). Hamming distance is widely used in approximate near neighbor search for high dimensional data, such as images and document collections. For example, using predefined similarity hash functions, high-dimensional data is mapped into one-dimensional binary codes that are, then linearly scanned to perform Hamming-distance comparisons. These distance comparisons on the binary codes are usually costly and, often involves excessive redundancies. This paper introduces a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries. An efficient search algorithm based on the HA-index is presented. A distributed version of the HA-index is introduced and algorithms for realizing Hamming distance-select and Hamming distance-join operations on a MapReduce platform are prototyped. Extensive experiments using real datasets demonstrates that the HA-index and the corresponding search algorithms achieve up to two orders of magnitude speedup over existing stateof-the-art approaches, while saving more than ten times in memory space.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Novel Approaches to Biomolecular Sequence Indexing

In many biomolecular database applications involving string/sequence data, it is common to have similarity search in the form of near neighbor queries or nearest neighbor queries. The similarity between strings/sequences are typically measured in terms of the least costly set of allowed edit operations that transform one string/sequence to another. In this survey, we briefly describe some of th...

متن کامل

Solving Multiple Queries through a Permutation Index in GPU

Query-by-content by means of similarity search is a fundamental operation for applications that deal with multimedia data. For this kind of query it is meaningless to look for elements exactly equal to the one given as query. Instead, we need to measure dissimilarity between the query object and each database object. The metric space model is a paradigm that allows modeling all similarity searc...

متن کامل

On the inverse maximum perfect matching problem under the bottleneck-type Hamming distance

Given an undirected network G(V,A,c) and a perfect matching M of G, the inverse maximum perfect matching problem consists of modifying minimally the elements of c so that M becomes a maximum perfect matching with respect to the modified vector. In this article, we consider the inverse problem when the modifications are measured by the weighted bottleneck-type Hamming distance. We propose an alg...

متن کامل

An Effective Path-aware Approach for Keyword Search over Data Graphs

Abstract—Keyword Search is known as a user-friendly alternative for structured languages to retrieve information from graph-structured data. Efficient retrieving of relevant answers to a keyword query and effective ranking of these answers according to their relevance are two main challenges in the keyword search over graph-structured data. In this paper, a novel scoring function is proposed, w...

متن کامل

Efficient Bootsrapping and Query Adaptive Ranking for Image Search

Scalable image search based on visual similarity has been an active topic of research in recent years. State-of-the-art solutions often use hashing methods to embed high-dimensional image features into Hamming space, where search can be performed in real-time based on Hamming distance of compact hash codes. Unlike traditional metrics (e.g., Euclidean) that offer continuous distances, the Hammin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015